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Abstract 

Conditions are given under which one may prove that the stochas- 
tic dynamics of on-line learning can be described by the deterministic 
evolution of a finite set of order parameters in the thermodynamic 
limit. A global constraint on the average magnitude of the increments 
in the stochastic process is necessary to ensure self-averaging. In the 
absence of such a constraint, convergence may only be in probability. 

On-line learning, introduced in [| , has become an important paradigm 
in the analysis of neural networks. Not only has it enabled the understand- 
ing of specific algorithms for a wide range of supervised learning scenarios 
and network architectures, e.g. |3|, |], ||, but one may also derive learning 
algorithms which are highly optimized for a specific problem, e.g. || |7j. 
Furthermore it is also possible to analyze unsupervised learning within this 
framework [§, [|. 

The key assumption in on-line learning is that the adaption of the network 
is driven, at each time step, by the presentation of a single pattern which 
is picked independently of any previous patterns. Thus the evolution of the 
state vector of the network is governed by a stochastic (Markov) process. 
However, if the underlying distribution of patterns is not too complicated, 
it is possible to characterize the performance of the network by a few order 
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parameters, and one expects these parameters to be self-averaging for large 
networks. This makes it possible to map the stochastic evolution of the state 
vector onto a deterministic evolution of the order parameters, thus greatly 
facilitating a theoretical understanding. 

While the self-averaging properties of the order parameters may usually 
be well observed numerically, a rigorous proof of this crucial fact has so far 
been lacking. The goal of this paper is to give conditions on the stochastic 
dynamics which ensure that it may be described by deterministic order pa- 
rameters in the thermodynamic limit and to clarify the nature of the conver- 
gence. We first review the customary heuristic derivation of the deterministic 
equations in the context of the perceptron learning rule. Next, a framework 
for the analysis of on-line learning is established which is general enough to 
cover many of the scenarios discussed in the literature. Within this frame- 
work we prove convergence by exploiting the fact that the thermodynamic 
limit is in some ways analogous to a small step size limit in order parameter 
space. (The small step size limit in weight space has been considered in [[TOf] ). 
Of course some assumptions about the stochastic process are needed for the 
proof, and the concluding paragraphs discuss examples to show that these as- 
sumptions, while not being necessary for convergence, are nevertheless quite 
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To fix ideas let us first consider the perceptron learning rule. We are given 
a sequence of inputs £ M £ M. N and corresponding outputs £ { — 1,1} and 
we assume that the inputs are picked independently from some probability 
distribution. We hope to approximate the mapping from input to output 
by a perceptron s w which implements the function s w ((,) = sign(u> T £) for 
w, £ £ M N . To this end, we use a new example s M ) to update our current 
estimate of a good weight vector w M by 

= W» + ~ , (1) 

N 2 K 1 

where r\ = r](fi/N) is a possibly time dependent learning rate. For simplicity 
we assume that the output is itself given by a perceptron with weight vector 
B, s M = s_b(£ m ). Then the quantity of interest is the angle between and 
B which may be calculated from the overlaps r M = B T w^ and = w^w^. 
One easily finds recursive equations for r^ +1 and q^ +l using ([I]). These will of 
course still depend on the entire input sequence but if we assume that 
the input components are picked independently from the normal distribution, 
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it is straightforward to average over the last input and find: 



N J2^ 
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Since r M and g M are still stochastic quantities the above equations do not 
seem very helpful. What one would really like to calculate, is averages over 
the entire sequence of inputs up to time /i, for instance: 

<r">„ = (^W,...,^ ■ (3) 

At this point it is customary to argue that in the thermodynamic limit, 
N — > oo but fi = O(N), the overlaps will be self-averaging and that r M will 
thus be close to (r^)^ for large N. Substituting the averages ((r M ) At , (q^)fj) for 
the stochastic quantities (r M , g M ) in the iteration (^), leads to deterministic 
finite difference equations, and, taking the large N limit once again, one 
arrives at the set of differential equations: 

f = rj{t)- ' v 



/27T 

o {^ r -Vy i ^ 2 arccos (r/yg) 

? = 2 w)— ^ + — ■ ( 4 ) 

V27T VT 

One now claims that for large N and identical or similar initial conditions 
(r tN \q tN ) will be close to (r(t),q(t)). 

The main goal of this paper is to make such claims precise and give 
conditions under which they are rigorously true. Since we do not want to 
confine ourselves to the perceptron, the theory should e.g. cover learning 
in multilayer perceptrons as well, we introduce a somewhat more general 
framework. Consider an iteration of the form 

= JP + N- 1 f{J lt ,Z li ) , (5) 

where the patterns £ M are picked independently from a probability distribu- 
tion on R LiV . The state vectors J M and the increments lie in some ~R Mn . The 
standard case in on-line learning is that the input dimension and the 
system size are on the order of N. Further let 

Q : R Mn x R Mn -f V C R n (6) 
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be a symmetric, bilinear mapping. The intended interpretation is that J^) 
gives the order parameters of the problem. 

It might seem that some order parameters, such as in the above per- 
ceptron rule, cannot be obtained by applying a quadratic form to the state 
vector. However, by using a larger state vector this can always be achieved, 
as well as for instance the transformation of a nonautonomous system into an 
autonomous one. In the perceptron case one may formally augment equation 
(JI|) with the following set of equations: 

^ +1 = ^, rf +1 = rf + 1/N, t£ +1 = tZ (7) 

and fix the initial conditions for the new recursions by b° = B, rf = 0, r° = 
1 . By aggregating w, b, r 1; r 2 to form a vector J of dimension 2N + 2 the set 
of equations (UfO) ^ s °f the general form (|j) . Furthermore we define a bilinear 
symmetric form taking values in M 3 via Q(J, J) = (w T w, b T w, T\T<i) , where 
J = (w, b, fi, 72). Now <2( J M , J^) = (q^,^, fi/N) and since we thus obtain 
the order parameters of the perceptron rule (0), this rule is indeed just a 
special case of the general framework 

Before proceeding with the general theory, a word of caution regarding 
our notation is in order. We are of course not considering a single stochastic 
process but a sequence of these. But to reduce notational overhead we have 
suppressed the index iV in symbols such as J M ,£ M , f,Q and the factor iV -1 
in (^) is just an attempt at suggestive notation. It is, however, crucial that 
the number n of order parameters (||) and the set of their possible values V 
be independent of N. 

Writing Q( J, J) more conveniently as J * J, the iteration @ yields the 
following relation for the order parameters Q 11 = J M * J M : 

qm+1 = Q* + N- 1 F(J' t ,£' t ) , (8) 
F(J,0 = 2J*f(J,0 + N~ 1 fUO*fUO- (9) 

For the to be the order parameters of the problem, the input average 
of the increment function F(J,£) should for large N converge to a quantity 
which only depends on J * J. At this point we shall just write 

(F(J,0)^G(J*J) (10) 

and be more precise about the kind of convergence later. That the limit 
iV — > 00 is not just the limit of small step size but a thermodynamic one, in 
which the system size M/y and the input size may diverge, is important 
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in the definition of G: The N~ x term in the increment function F given by 
d^) will in general give a finite contribution to G. 

We may now associate to the stochastic process (f|) the deterministic 
trajectory 

Q = G(Q) (11) 

and ask whether Q tN converges to Q(t) for large N. Here and in the sequel, 
the convention is that a real expression is rounded when it appears in the 
position of an integer index, like tN in Q tN . For a given initial condition Q(0), 
we assume that (O) has a solution Q(t) up to certain time a. We further 
require the existence of a compact set U which contains a neighborhood of 
the trajectory, more formally 

{x G V : \x-Q(t)\ < e} C U . (12) 

Note that here and in the sequel we assume < t < a. 

We are now in the position to state conditions for the convergence of the 
stochastic process to the deterministic trajectory: 

(a) \G(Qi) — G(Qo)\ < C\Qi — Qo\ for Q\,Qq G U and some constant C. 

(b) \Q°-Q(0)\<l(N), 

\(F(J,£))s-G(J*J)\<h(N) HJ*JeU, 

for suitable functions I and h with limjv^oo l(N) = linijv^oo h(N) = 0. 

(c) (|F(J, £)| 2 )^ < C 2 (| J* J\ + 1) 2 , for convenience we use the same constant 

C, independent of iV as in condition (a). 

The Lipschitz condition (a) makes sure that there is a unique deterministic 
trajectory given the initial value Q(0). Note that this condition is only re- 
quired to hold in the neighborhood U of the deterministic trajectory. Indeed, 
by considering e.g. the limit q — > for the perceptron rule (^), one sees that 
even for this simple case no global Lipschitz condition holds. Condition (b) 
clarifies the required relationship between the stochastic process and the de- 
terministic trajectory: Initial conditions should converge and so should the 
increments, at least on average and in the neighborhood U. The perhaps 
most interesting condition is (c). In the case of the perceptron learning rule, 
the fourth moments of the input distribution must exist, for the LHS of (c) 
to be defined. Further the condition implies 

(c») \(F(J,£)) S \ < C{\J * J| +1). 
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and is thus a global constraint on the growth of the increments. 

Given these conditions, one may prove convergence by considering the 
difference A M = — Q(p/N) between the stochastic and the deterministic 
trajectory Using the abbreviation = N(Q(ji/N + 1/N) — Q(fi/N)) the 
following recursive relation for the variance a^ 1 of A M is obtained from (H): 

= /|A^ +1 H 
\ 1 1 / fi+i 

A» + hF(J»,e)-gn\ 2 ) 

= <r»+ (13) 

2/N(A» T (F(j^e)-gn)» +1 + 

1/N 2 (\f(j», m 2 - 2F(j», e) v + isiVi 

The next step is to find an upper bound on the increments to which 
depends only on o~ M itself. For the 2/iV-term in the above equation we need 
to distinguish between being in U or not. So we rewrite this term as: 

<A^(F(J^,e)-^)> M+1 = 

<(1 - 0(|A"| - e))A" T «F(J", - + 

(e(\A^\-e)A^((F(j^,Oh-gn), (14) 

In the first summand one rewrites the difference as: 

(F(J",0>e-0" = (F(J»,0)z-G(Qn + 

G(Q*)-G(Q(p/N)) + 
G{Q{n/N))-gr. (15) 

Expanding the product of A^ T with the above RHS, applying the triangle 
inequality and then (b) and (a) one obtains |12| an upper bound: 

<(1 - B{\ A"| - e))A" T «F(J", ~ g"))» < 

(h(N)+C 2 /N)V^+Ca fl . (16) 

To bound the second term in ( |H| ) one uses that the growth condition (c) 
implies 

|A^ T ((F(J^,0) 5 -^)| < |A^f C+ |A^|(C 2 + C) (17) 
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and thus: 



| A1 _ e )A^((F(^,0) 5 -^)>,< 
CV 1 + (C 2 + C) (0(|A"| - e)|A"|) u . (18) 



The remaining average can be further simplified by applying a Tschebyscheff 
inequality: (6*(| A^l — e)|A M |)^ < a^/e. Using the growth condition (c) one 
may bound the 1/N 2 term in ([13]) and combining this with (16,151) finally 
yields: 

A/N > , (19) 

u(a) = (h(N)y^ + a) + N-\l + ^+cr) , 

for a suitable positive A which depends only on C and e. Note that the 
bound flTQ| ) holds for any [i and N. By rewriting its RHS as an integral and 
summing over fi one finds f u(a)~ 1 da < Afi/N. Replacing the term y/a 
in m(ct) by its value at the upper limit cr M , makes the integral both smaller 
and simpler and in the end yields the following key inequality: 

<r" <4(iV- 1 + l(N) 2 + h(N) 2 )exp(4A^ . (20) 

Consequently for /i = tiV the variance decays to zero in the large iV limit, 
the probability of Q tN deviating from the sequence average {Q tN )tN vanishes 



and the stochastic process is self-averaging. [13] 

Let us next consider relaxing the global constraint (c). Assume a situation 
where (c) holds for J * J 6 U but not necessarily outside of U . We may then 
replace the update rule / in (|) by 

Then, for identical initial conditions, the deterministic trajectory of / and 
/ will be the same. Further all of the conditions (a-c) hold for / and this 
stochastic process is self-averaging. Since the deterministic trajectory lies 
strictly in the interior of U and since the increments / are zero outside of 
U, this implies that the probability of Q tN not lying in U (for any t G [0, a}) 
must vanish for large N. So given the same input sequence / will typically 
give the same result as the unmodified dynamics /, and thus Q tN converges 
to Q(t) in probability. Thus, for this weaker notion of convergence, no global 
constraint is needed. 
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To be able to conclude, however, that in such a situation the stochastic 
process given by / is self-averaging, we would have to know that Q tN is well 
behaved on untypical sequences as well. A simple example will be sufficient 
to show that this need not be the case. Consider the following random walk 
with a step size which depends on the length of the current vector: 

= jv + n-\Q» - (22) 

Here we assume J M ,£ M G M , = |J M | 2 and the components of the £ M 
are picked independently from the normal distribution. Averaging the self- 
overlap Q^ +1 with respect to the last input yields 

(Q M+ %< =Q» + N- 1 (Q» -l) 2 (23) 

and condition (c) is violated. The deterministic trajectory is given by Q = 
(Q — l) 2 and while it is defined for all times if Q(0) < 1, it will diverge at 
some finite time if the initial condition has Q(0) > 1. Consequently one will 
expect Q tN to diverge with iV if Q° is greater than 1 and t is sufficiently large. 
To obtain a lower bound on this divergence, first note that by convexity of 



the RHS in @ 



(Q^ + Vi > (Q*)v + N-\(Q>*), - l) 2 . (24) 

Setting Q" = - 1 yields Q^ +1 >Q^ + N~\Q») 2 and dropping the first 

summand allows us to solve the recurrence and find 

Q» > (<T/ViV) 2 ™ . (25) 



Thus (and (Q^)^) will increase super-exponentially with \i if ever Q 



becomes larger than v N. 

Let us now consider the dynamics for an initial condition Q° = Q(0) = 0. 
By the general theory presented above Q tN will with increasing iV converge 
in probability to Q(tN) for any fixed t. There is, however, a small probability 
of making a large first step. In particular, the probability of having Q 1 > 
N is larger than exp(— P(JV)), where P{N) is a suitable polynomial in N. 
Whenever such a rare event (Q 1 > N) occurs, due to ( p5|) the following steps 
lead to an extremely fast growth. Consequently (Q tN )tN diverges with N 
for any positive t and the stochastic dynamics is not self- averaging in the 
thermodynamic limit. 

While we believe that the conditions imposed on the stochastic process 
are not overly restrictive, they are not necessary for self-averaging to hold. 
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A case in point is the perceptron rule ([I]). While conditions (a-c) are true for 
an initial condition with q(0) > 0, the Lipschitz condition (a) is violated for 
q(0) = and it is not possible to define r/^Jq in a manner that would make 
the RHS of @ continuous in the point g(0) = 0. Nevertheless, numerical 
simulations indicate that the perceptron rule is self-averaging for this initial 
condition. But this property is highly dependent on fine details of the input 
distribution: Instead of always choosing Gaussian inputs, consider presenting 
a Gaussian input only in one half of the steps and else presenting some 
fixed vector. More formally, let the input £ be the random variable £ = 
x£ + (1 — x)N~ 1 b, where £ has normally distributed components, x is or 1 
with equal probability and b is a fixed vector with \b\ < 1. The deterministic 
trajectory does not depend on the specific choice of b and is, up to a factor of 
1/2, still given by (^). If we choose 6 = 0, the stochastic process is essentially 
the same as for plain Gaussian inputs, except that, on average, the weight 
vector does not change in every second step. But now consider the choice 
b = B and assume that sign(0) = 0. For q° = in one half of the cases perfect 
generalization will be achieved in the first step and subsequently the weight 
vector will not change. However, if the initial input is Gaussian (x° = 1), 
any subsequent presentation of B as input will not change the weight vector 
since we already have positive overlap with the teacher and the subsequent 
dynamics will be the same as for the choice 6 = 0. So, for b = B, the first 
step is crucial and the on-line dynamics is not self-averaging. 

While the above example is rather construed, it does nevertheless show 
that the self-averaging properties of on-line learning can depend on rather 
minute details of the stochastic process if the Lipschitz condition is violated. 
Consequently we believe that it is difficult to find easily verifiable conditions 
for self- averaging which are much weaker than the ones presented here. 

The authors wish to acknowledge helpful discussions with M. Biehl, W. 
Kinzel, and M. Opper. The work of one of us (R.U.) was supported by the 
Deutsche Forschungsgemeinschaft. 
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